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Abstract: A comparative sentence expresses an ordering relation between two sets of entities with respect 
to some common features. For example, the comparative sentence "Canon's optics are better than+th^fyf 
Sony and Nikon" expresses the comparative relation Comparing one thing with another is a typical^^t of 
human decision making process. However, it is not always easy to know what to compare and vjfoL are the 
alternatives. To address this difficulty, we present a new way for automatically extractin^cainparable 
entities from comparative questions based on the pattern. We propose new techniques ba^cfon these two 
types of sequential rules to perform the tasks. O > 

Introduction ^ \J 

Comparing alternative options is one of the essential things in decision-makin|^Wtt we carry out every day. 
Example, if someone is interested in certain products such as digital carreiaik h* or she would want to know 
what the different alternatives we have and compare different camera ^S^il making a purchase. This type 
of comparison activity is very common in our daily life but require«Mjrknowledge skill. And Mags 



such as Consumer Reports and PC Magazine and online mediy^y/h as CNet.com strive in providing 
editorial comparison content and surveys to satisfy this need. I^^^orld Wide Web, a comparison activity 
typically involves: search for relevant web pages containinslnfafmation about the targeted products, find 
competing products, read reviews, and identify pros andjt^mR In this paper, we focus on finding a set of 
comparable entities given a users input entity. For exafcAle, given an entity, Nokia N95 (a cellphone), we 
want to find comparable entities such as Nokia N&Ji^rwne and so on. In general, it is difficult to decide if 
two entities are comparable or not since people%^vrompare apples and oranges for various reasons. For 
example, "Ford" and "BMW" might be comparable as "car manufacturers" or as "market segments that their 
products are targeting", but we rarely se^ros^pie comparing "Ford Focus" (car model) and "BMW 3281'. 
Things also get more complicated wh|fr>aK entity has several functionalities. For example, one might 
compare "iPhone" and "PSP" as "poAaffl^mrie player" while compare "iPhone" and "Nokia N95" as "mobile 
phone". Fortunately, plenty of mmpa|aBve questions are posted online, which provide evidences for what 
people want to compare, e.g. "VVfflfc/i to buy, iPod or iPhone 7 .". We call "iPod" and "iPhone" in this example as 
comparators. In this paper, wCT^fifte comparative questions and comparators as: 

Comparative question|j^^uestion that intends to compare two or more entities and it has to mention 
these entities explicitly/mhe question. 

Comparator: A^^flty which is a target of comparison in a comparative question. 

Comparisnfe^an be subjective or objective. For example, a typical opinion sentence is "The picture quality 
of camSrT^% great" A subjective comparison is "the picture quality of camera x is better than that of camera 
yJ^?t^bjective comparison is "car x is 2 feet longer than car y". We can see that comparative sentences use 
drfcferafit language constructs from typical opinion sentences (although the first comparative sentence 
above is also an opinion). In this paper, we study the problem of comparative sentence mining. It has two 
tasks: 

1. Given a set of evaluative texts, identify comparative sentences from them, and classify the 
identified comparative sentences into different types (or classes). 

2. Extract comparative relations from the identified sentences. This involves the extraction of entities 
and their features that are being compared, and comparative keywords. The relation is expressed 
with (<relationWord>, <features>, <entitySi>, <entityS2>) 
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For example, we have the comparative sentence "Canon's optics is better than those of Sony and Nikon." The 
extracted relation is: (better, {optics}, {Canon}, {Sony, Nikon}) 

Both tasks are very challenging. Although we see that the above sentences all contain some indicators i.e., 
"better", "longer", many sentences that contain such words are not comparatives, e.g., "I cannot agree with 
you more". The second step is a difficult information extraction problem. For the first task, we present an 
approach that integrates class sequential rules (CSR) and naive Bayesian classification to perform the task. 
This task is studied in detail in (Jindal & Liu 2006). We include it for completeness. For the second task, a 
new type of rules called label sequential rules (LSR) is proposed for extraction. Our results show that>H$Rs 
outperform Conditional Random Fields (CRF) (Lafferty, McCallum & Pereira 2001), which is perht^^the 
most effective extraction method so far (Mooney & Bunescu 2005). * 

Types of Sequential Rules ^ ^) 



c \ 

We now start to present the proposed techniques, which are based on two types of sa|u^tfel rules. Mining 
of such rules is related to mining of sequential patterns (SPM) (Agrawal and Srikafcj^Q4). Given a set of 
input sequences, SPM finds all subsequences (called sequential patterns) thJi^itisfy a user-specified 
minimum support threshold. Below, we first explain some notations, and thei^abne the two new types of 
rules, Class sequential rules (CSR) used in classification of sentences, andy^l sequential rules (LSR) used in 
relation item extraction. For more details about these types of rules a^^toj mining algorithms, please see 
(Liu 2006). 

Let / = {/1, Z2, in} be a set of items. A sequence is an ordered li^^^ itemsets. An itemset X is a non-empty 
set of items. We denote a sequence s by (a 1 a 2 ,...a r ), where a t ^s aji itemset, also called an element of s. We 
denote an element of a sequence by {x lt x 2 , x;J, where^^Wan item. An item can occur only once in an 
element of a sequence, but can occur multiple times M^Jifrerent elements. A sequence si = (a^.-.a,.) is a 
subsequence of another sequence S2 = (b 1 b 2 ...b m ) or s 2 ^^upersequence of s„ if there exist integers 1 < ;\ < j 2 
< ... < < j r such that a, £ b^ , a 2 £ bp, a r £ b ir ^^*lso say that s 2 contains s^ 
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jquential Rules 



Let S be a set of data sequences. Eaah^eVju^ice is labeled with a class y. Let Y be the set of all classes, I f] Y 
= 0. Thus, the input data D for mfl^m^is represented with D = {(s„ yj, (s 2 , y 2 ), (s„, y n )}, where s,- is a 
sequence and y, £ Yis its class, ^ffoss sequential rule (CSR) is an implication of the formX— > y, where X is a 
sequence, and y £ Y. A data iHg^nJre (s„ y,) in D is said to cover the CSR if X is a subsequence of s,. A data 
instance (s„ y,) is said to %twmh. CSR if X is a subsequence of s, and y, = y. The support (sup) of the rule is 
the fraction of total instafccgun D that satisfies the rule. The confidence (conf) of the rule is the proportion 
of instances in D that f^krs the rule also satisfies the rule. Given a labeled sequence data set D, a minimum 
support (minsuo) /fi?kMninimum confidence (minconf) threshold, CSR mining finds all class sequential 
rules in D. 




Label Sequential Rules 



ential rule (LSR) is of the following form, 



where Y is a sequence and X is a sequence produced from Y by replacing some of its items with wildcards. A 
wildcard, denoted by a '*', matches any item. The definitions of support and confidence are similar to those 
above. The input data is a set of sequences, called data sequences. 
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Mining Indicative Extraction Patterns 

Our weakly supervised IEP mining approach is based on two key assumptions: 

• If a sequential pattern can be used to extract many reliable comparator pairs, it is very likely to be 
an IEP. 

• If a comparator pair can be extracted by an IEP, the pair is reliable. 
Based on these two assumptions, we design our bootstrapping algorithm. 



The bootstrapping process starts with a single IEP. From it, we extract a set of initial seed comparator^airs. 



rparator 

For each comparator pair, all questions containing the pair are retrieved from a question cjjB^Jon and 
regarded as comparative questions. From the comparative questions and comparator P2&fVi possible 
sequential patterns are generated and evaluated by measuring their reliability score^era|ecilater in the 
Pattern Evaluation section. Patterns evaluated as reliable ones are IEPs and aN^dd*d into an IEP 
repository. Then, new comparator pairs are extracted from the question collection ufcatne latest IEPs. The 
new comparators are added to a reliable comparator repository and used as new^eusfor pattern learning 
in the next iteration. All questions from which reliable comparators are extf^fcgfr are removed from the 
collection to allow finding new patterns efficiently in later iterations. TheproSress iterates until no more 
new patterns can be found from the question collection. 



Pattern Generation '_ 

To generate sequential patterns, we adapt the surface textXfrattern mining method introduced in 
(Ravichandran and Hovy, 2002). For any given comparative^^stion and its comparator pairs, comparators 
in the question are replaced with symbol $Cs. Two symbffi^start and #end, are attached to the beginning 
and the end of a sentence in the question. Then,^J^rollowing three lands of sequential patterns are 
generated from sequences of questions: 

Lexical patterns: Lexical patterns indicate^fql^ntial patterns consisting of only words and symbols ($C, 
#start, and #end). They are generated btVunSx tree algorithm (Gusfield, 1997) with two constraints: A 
pattern should contain more than orfcsC, and its frequency in collection should be more than an 
empirically determined number /?. 

Generalized patterns: A lexi^ajjpattern can be too specific. Thus, we generalize lexical patterns by 
replacing one or more wo^s^^h their POS tags, m - 1 generalized patterns can be produced from a lexical 
pattern containing N wor^s\^ctuding $Cs. 

Specialized patteiflfc2^some cases, a pattern can be too general. For example, although a question "ipod 
or zune?"is com^^ij^e, the pattern "<$C or $C>"is too general, and there can be many noncomparative 
questions rn^Kflipg the pattern, for instance, "true or false?". For this reason, we perform pattern 
specializati«^py adding POS tags to all comparator slots. For example, from the lexical pattern "<$C or 
$C>" and^^ question "ipod or zune?", "<$C/NN or $C/NN?>" will be produced as a specialized pattern. 
Nofr^fcaa generalized patterns are generated from lexical patterns and the specialized patterns are 
gwieraed from the combined set of generalized patterns and lexical patterns. The final set of candidate 
pat%ims is a mixture of lexical patterns, generalized patterns and specialized patterns. 

Conclusion 

This paper studied the new problem of identifying comparative sentences in evaluative texts, and extracting 
comparative relations from them. Two techniques were proposed to perform the tasks, based on class 
sequential rules and label sequential rules, which give us syntactic clues of comparative relations. 
Experimental results show that these methods are quite promising. 
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